Body Fat Percentage Presentation

Group 013E01

Joanne Lim, John Fu

Data Description

  • Accurately body fat measurement is vital but often inconvenient and expensive
  • Can we use simpler measurements like height, weight, and circumferences for estimates?
  • The data set bodyfat contains body fat percentages and other related measurements for 250 men. These measurements, which include height, weight and various body circumferences, were collected to explore alternatives to underwater body fat assessments.
Rows: 250
Columns: 16
$ Density <dbl> 1.0708, 1.0853, 1.0414, 1.0751, 1.0340, 1.0502, 1.0549, 1.0704…
$ Pct.BF  <dbl> 12.3, 6.1, 25.3, 10.4, 28.7, 20.9, 19.2, 12.4, 4.1, 11.7, 7.1,…
$ Age     <int> 23, 22, 22, 26, 24, 24, 26, 25, 25, 23, 26, 27, 32, 30, 35, 35…
$ Weight  <dbl> 154.25, 173.25, 154.00, 184.75, 184.25, 210.25, 181.00, 176.00…
$ Height  <dbl> 67.75, 72.25, 66.25, 72.25, 71.25, 74.75, 69.75, 72.50, 74.00,…
$ Neck    <dbl> 36.2, 38.5, 34.0, 37.4, 34.4, 39.0, 36.4, 37.8, 38.1, 42.1, 38…
$ Chest   <dbl> 93.1, 93.6, 95.8, 101.8, 97.3, 104.5, 105.1, 99.6, 100.9, 99.6…
$ Abdomen <dbl> 85.2, 83.0, 87.9, 86.4, 100.0, 94.4, 90.7, 88.5, 82.5, 88.6, 8…
$ Waist   <dbl> 33.54331, 32.67717, 34.60630, 34.01575, 39.37008, 37.16535, 35…
$ Hip     <dbl> 94.5, 98.7, 99.2, 101.2, 101.9, 107.8, 100.3, 97.1, 99.9, 104.…
$ Thigh   <dbl> 59.0, 58.7, 59.6, 60.1, 63.2, 66.0, 58.4, 60.0, 62.9, 63.1, 59…
$ Knee    <dbl> 37.3, 37.3, 38.9, 37.3, 42.2, 42.0, 38.3, 39.4, 38.3, 41.7, 39…
$ Ankle   <dbl> 21.9, 23.4, 24.0, 22.8, 24.0, 25.6, 22.9, 23.2, 23.8, 25.0, 25…
$ Bicep   <dbl> 32.0, 30.5, 28.8, 32.4, 32.2, 35.7, 31.9, 30.5, 35.9, 35.6, 32…
$ Forearm <dbl> 27.4, 28.9, 25.2, 29.4, 27.7, 30.6, 27.8, 29.0, 31.1, 30.0, 29…
$ Wrist   <dbl> 17.1, 18.2, 16.6, 18.2, 17.7, 18.8, 17.7, 18.8, 18.2, 19.2, 18…

Null and Full Model


Call:
lm(formula = Pct.BF ~ ., data = bodyfat)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.3746 -0.3725 -0.1157  0.2358 15.0629 

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.494e+02  1.154e+01  38.961   <2e-16 ***
Density     -4.098e+02  8.384e+00 -48.876   <2e-16 ***
Age          1.395e-02  9.721e-03   1.435    0.153    
Weight       1.527e-02  2.015e-02   0.758    0.449    
Height      -1.558e-02  5.752e-02  -0.271    0.787    
Neck        -1.653e-02  7.084e-02  -0.233    0.816    
Chest        1.790e-02  3.259e-02   0.549    0.583    
Abdomen      1.833e-02  3.286e-02   0.558    0.578    
Waist               NA         NA      NA       NA    
Hip          2.537e-02  4.391e-02   0.578    0.564    
Thigh       -2.107e-02  4.421e-02  -0.476    0.634    
Knee        -1.657e-02  7.366e-02  -0.225    0.822    
Ankle       -8.160e-02  6.616e-02  -1.233    0.219    
Bicep       -5.256e-02  5.132e-02  -1.024    0.307    
Forearm      1.405e-02  6.229e-02   0.225    0.822    
Wrist       -1.883e-02  1.640e-01  -0.115    0.909    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.276 on 235 degrees of freedom
Multiple R-squared:  0.9777,    Adjusted R-squared:  0.9763 
F-statistic: 734.4 on 14 and 235 DF,  p-value: < 2.2e-16

Null and Full Model

metric M0 M1
r.squared 0.98 0.00
adj.r.squared 0.98 0.00
sigma 1.28 8.29
statistic 734.37
p.value 0.00
df 14.00
logLik −407.98 −883.12
AIC 847.96 1,770.23
BIC 904.31 1,777.28
deviance 382.77 17,128.82
df.residual 235.00 249.00
nobs 250.00 250.00

Backward stepwise selection


Call:
lm(formula = Pct.BF ~ Density + Age + Abdomen, data = bodyfat)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.2913 -0.3576 -0.0911  0.2319 15.4601 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.424e+02  8.738e+00  50.626  < 2e-16 ***
Density     -4.065e+02  7.279e+00 -55.844  < 2e-16 ***
Age          1.182e-02  6.579e-03   1.796   0.0737 .  
Abdomen      5.761e-02  1.332e-02   4.326 2.21e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.26 on 246 degrees of freedom
Multiple R-squared:  0.9772,    Adjusted R-squared:  0.9769 
F-statistic:  3513 on 3 and 246 DF,  p-value: < 2.2e-16

Forward stepwise selection


Call:
lm(formula = Pct.BF ~ Density + Abdomen + Age, data = bodyfat)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.2913 -0.3576 -0.0911  0.2319 15.4601 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.424e+02  8.738e+00  50.626  < 2e-16 ***
Density     -4.065e+02  7.279e+00 -55.844  < 2e-16 ***
Abdomen      5.761e-02  1.332e-02   4.326 2.21e-05 ***
Age          1.182e-02  6.579e-03   1.796   0.0737 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.26 on 246 degrees of freedom
Multiple R-squared:  0.9772,    Adjusted R-squared:  0.9769 
F-statistic:  3513 on 3 and 246 DF,  p-value: < 2.2e-16

Summary of all models

  Full Backward Forward
Predictors Estimates p Estimates p Estimates p
(Intercept) 449.43 <0.001 442.38 <0.001 442.38 <0.001
Density -409.76 <0.001 -406.49 <0.001 -406.49 <0.001
Age 0.01 0.153 0.01 0.074 0.01 0.074
Weight 0.02 0.449
Height -0.02 0.787
Neck -0.02 0.816
Chest 0.02 0.583
Abdomen 0.02 0.578 0.06 <0.001 0.06 <0.001
Hip 0.03 0.564
Thigh -0.02 0.634
Knee -0.02 0.822
Ankle -0.08 0.219
Bicep -0.05 0.307
Forearm 0.01 0.822
Wrist -0.02 0.909
Observations 250 250 250
R2 / R2 adjusted 0.978 / 0.976 0.977 / 0.977 0.977 / 0.977
AIC 847.962 831.066 831.066

Model selection conclusion

  • Generally, higher values of R-squared are better, but a very high R-squared could suggest overfitting, especially if it is much higher than the adjusted R-squared.
  • Lower AIC values indicate a better-fitting model.
  • Given that we get the same results from backward selection model and forward selection model, and both of them have the lowest AIC values and its R-squared and adjusted R-squared is the same compared to the full model, we will be using backward model for further interpretation.

Assumption checking: Pct.BF vs Age

  • The plot on the left shows the relationship between Age and Body Fat Percentage. It is suggesting a general trend where body fat percentage might increase with age.
  • According to the linearity of Pct.BF and age graph we can see that there is one person which Pct.BF is 0, and the various age groups are not evenly distributed in the data set. This may cause the deviation in the analyse. However, given there is no specific patterns and the spread of residuals is roughly even above and below the central line and across the range of fitted values, linearity and homoskedasticity assumption is reasonably met.

Assumption checking: Pct.BF vs Density

  • The relationship between Density and Body Fat Percentage appears to be quite strong and linear. As Density goes up, Pct.BF does down, and vice versa. The distribution of data points closely follows the regression line.
  • The residuals showing on the right plot indicates strong linearity without any discernible pattern.

Assumption checking: Pct.BF vs Abdomen

  • There seems to be a fairly strong positive linear correlation between Abdomen and Body Fat Percentage. As abdomen size increases, Pct.BF also tends to go up.
  • The spread of residuals is roughly even above and below the central line and across the range of fitted values.

Assumption checking for the model (1)

  • Linearity: The residuals seem to be randomly scattered around the horizontal line. However, there are a few outliers (points labeled with numbers), but they don’t appear to form a systematic pattern. Overall the assumption of linearity seems to be reasonably met.

  • Homoskedasticity: The residuals don’t appear to be fanning out or changing their variability over the range of the fitted values so the constant error variance assumption is met.

Assumption checking for the model (2)

Independence

  • The independence of error terms is crucial and typically addressed during the initial phases of experimental design, i.e. before data collection. Each variable is designed to maintain its independence and since each observation doesn’t inherently influence another, we can conclude that they are independent of each other.

Assumption checking for the model (3)

Check Normality

  • Apart from three points in the upper tail and one point in the lower tail, the majority of points lie quite close to the line in the QQ plot. Hence, the normality assumption for the residuals is reasonably well satisfied.

  • Additionally, we have quite large sample size so we can also rely on the central limit theorem to give us approximately valid inferences.

Final fitted model


Call:
lm(formula = Pct.BF ~ Density + Age + Abdomen, data = bodyfat)

Coefficients:
(Intercept)      Density          Age      Abdomen  
  442.37549   -406.49296      0.01182      0.05761  
  • Fitted model: \[Pct.BF = 442.38 - 406.49 \times Density + 0.01 \times Age + 0.06 \times Abdomen\]

Discussion of results

  1. On average, holding the other variables constant, a year increase in age leads to a 0.01 increase in body fat percentage.
  2. On average, holding the other variables constant, a 1cm increase in abdomen leads to a 0.06 increase in body fat percentage.
  3. On average, holding the other variables constant, a unit increase in density leads to a decrease of 406.49 in body fat percentage.
  4. When all predictor variables are set to zero, the expected body fat percentage is 442.38%. However, this interpretation might not be practically meaningful, especially if a value of zero for these predictors isn’t feasible or meaningful in real-world scenarios.

Cross Validation

This is the outcome of performance for our fitted model: \(Pct.BF = 442.38 - 406.49 \times Density + 0.01 \times Age + 0.06 \times Abdomen\).

Linear Regression 

250 samples
  3 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 224, 226, 226, 225, 226, 225, ... 
Resampling results:

  RMSE       Rsquared   MAE      
  0.9316619  0.9729571  0.4930769

Tuning parameter 'intercept' was held constant at a value of TRUE

Summary of Cross Validation

Interpretation for Cross Validation

  • Smaller RMSE and MAE value indicates better fit to the data.
  • Higher values of R-squared are better.
  • Comparing the full model with 3 simple model:
  1. Simple_age and simple_abdomen has relatively higher MAE and RMSE value
  2. Simple_density has the lowest R-squared value.

Therefore, full(\(Pct.BF = 442.38 - 406.49 \times Density + 0.01 \times Age + 0.06 \times Abdomen\).) will be the best model.

Conclusions and recommendations

  • It can be concluded from the correlation analysis of the data that Pct.BF, as an important indicator for evaluating body composition, correlates very significantly with Density, Age, and Abdomen.

  • Using Pct.BF as the dependent variable and Density, Age, and Abdomen as the independent variables, the linear regression equation derived from the multiple linear stepwise regression analysis is

\[Pct.BF = 442.38 - 406.49 \times Density + 0.01 \times Age + 0.06 \times Abdomen\]

Through Cross Validation, this linear regression equation is the optimal model, which is significant in that it provides a simple, accurate, and inexpensive method for calculating the Pct.BF again.

  • This linear regression equation for calculating Pct.BF does not differentiate for human gender, which is a shortcoming of this regression equation, and we hope to supplement and improve this regression equation through further research.